New alphabet— dependent morphological transition in a random RNA alignment 
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We study the fraction / of nucleotides involved in the formation of a cactus-like secondary struc- 
ture of random heteropolymer RNA-like molecules. In the low-temperature limit we study this 
fraction as a function of the number c of different nucleotide species. We show, that with changing 
c, the secondary structures of random RNAs undergo a morphological transition: /(c) — > 1 for 
c < c cr as the chain length n goes to infinity, signaling the formation of a virtually "perfect" gapless 
secondary structure; while /(c) < 1 for c > c cr , what means that a non-perfect structure with gaps 
is formed. The strict upper and lower bounds 2 < c cr < 4 are proven, and the numerical evidence 
for c cr is presented. The relevance of the transition from the evolutional point of view is discussed. 



Genetic information in all life cells is kept within the 
primary sequences of DNA and RNA molecules. Both 
of them are heteropolymers consisting of four different 
nucleotide types. Why does nature use exactly four 
aminoacid bases? Could one find any properties of sys- 
tems containing DNAs or RNAs sensitive to the number 
of different "letters" (i.e. different nucleotide types) used 
in construction of these heteropolymers? Typically, the 
attempts to answer this question are based on the chem- 
istry of interacting nucleotides or deal with the con- 
jectures lying in the general information theory Q. Here 
we present a statistical observation concerning the depen- 
dence of the RNA secondary structures on the number 
of nucleotide types (alphabet size), c, which, to the best 
of our knowledge, was never discussed before. 

We demonstrate the existence of a morphological tran- 
sition in the statistics of the secondary structure of a ran- 
dom RNA-like chain as a function of the alphabet size, 
c. Namely, for small c, c < c cr long enough chains can 
form a "perfect" secondary structure, i.e. a structure in 
which the fraction of paired nucleotides (i.e. connected to 
the complementary ones via hydrogen bonds) approaches 
one as the chain length goes to infinity, while for c > c cr 
even the best possible secondary structure includes a fi- 
nite fraction of gaps (i.e., nucleotides which have nobody 
to connect with). 

Note that this problem belongs to the class of satisfia- 
bility ones, such as the celebrated fc-SAT problem 
Indeed, we are looking for a transition from a situation 
when some problem (in our case, a search for a perfect 
secondary structure) is almost surely solvable for any ran- 
dom initial conditions (nucleotide sequences) to the situ- 
ation when it is almost surely unsolvable. This transition 
occurs with a change of alphabet size, i.e. c plays a role 
analogous to M/N (a ratio of the number of equations 
to the number of variables) in fc-SAT. 

As for the particular value of the critical alphabet size, 
at which the transition occurs, we prove that it lies in the 



the upper, c™ ax = 4, bounds can be computed exactly. 
The numerical estimates of c cr are rather restrictive since 
c takes integer values only. However we argue below that 
with some minor modification, the problem under con- 
sideration can be naturally generalized to non-integer c. 
Numerical evaluation of this generalized problem leads 
to the value c" um w 2.7. 
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Figure 1: (a) Secondary structure of an RNA gene HAR1F 
(see, for example, 0]); (b) and (c) Schematic cactus-like 
structures of RNA-like chains represented in Figf2]by Motzkin 
and Dyck paths respectively. 

The RNA's secondary structure prediction deals with 
a search for the structure with the lowest value of the 
free energy among all allowed cactus-like structures. Nu- 
merous dynamic programming algorithms (DPA) are de- 
veloped to that end Q. In the simplest possible case 
one supposes that a given chain consists of n monomer 
units, each unit chosen from a set of c different types (let- 
ters) A, B, C, D,.... These units can form non-covalent 
bonds with each other, at most one bond per unit. The 
energy of a bond depends on which letters are bonded, 
for example, one can assign an attraction energy u to 
the bonds between similar letters (A-A, B-B, etc., we 
call them "matches") and zero energy - to the bonds 
between different letters (A-B, A-D, etc, "mismatches") 
[23l ] . The topology of secondary structures is supposed to 
be "cactus-like" , i.e. hierarchically folded and topolog- 
ically isomorphic to a tree (we suppose here that struc- 
tures which do not have the tree like, known as "pscudo- 
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knots", are suppressed). To simplify the model as much 
as possible we do not allow here for any constraints on 
the minimal size of loops in the structure, the variation 
in the energies of different types of matchings, nor for the 
contribution of loop factors to the partition function, or 
the stacking interactions (the cooperativity in formation 
of bonds between adjacent pairs of monomers). Despite 
these essential simplifications, the considered model is 
known to be a common "firing ground" for theoretical 
consideration of secondary structures formed in the en- 
semble of messenger RNAs Q. 

The partition function of the random RNA-type het- 
eropolymer is known (see, for example, @ in the context 
of matching models) to satisfy the recursion: 
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These equations (the analogues of which are used all 
over the place in the RNA-folding theory, see, for exam- 
ple, [10l - [l3| ) generate the hierarchical cactus-like RNA 
topology. The term g^j describes the statistical weight 
of the part of the sequence between monomers i and j. 
The Boltzmann weights j3i j (1 < i < j < n) are the sta- 
tistical weights of bonds: = e u l T if i and j match, 
and (3ij = 1 otherwise. Now, energy of the ground state 
is just a limiting value of the free energy as the temper- 
ature approaches After some 

algebra (see 0] for details) one can reduce the expression 
for E to the following form: 



E ii+k = lim T\ng ii+k = max \E i+1 , i+k , 
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where e^j is the interaction energy between monomers i 
and j, it equals u if these monomers match, and if they 
do not. 

For random RNA-type sequence made of c letters, the 
average energy of a ground state, Ei <n , for n 3> 1 be- 
haves as (E) ~ Tp/(c), where /(c) is the fraction of 
nucleotides which have formed bonds. We argue that 
for c less than a certain value c cr there exists a "per- 
fect" match, i.e., /(c) = 1, and the fraction of connected 
monomers converges to unity, while for c < c cr a finite 
fraction of monomers f — /(c) remains unmatched even 
in the best match, i.e. /(c) < 1. Below we compute the 
exact lower and upper bounds for c cr , derive the upper 
bound for /(c) in the c > c cr -region, and discuss the 
numerical evidence of our conjecture. 

Consider c™ m = 2. It turns out that matching with 
/(c = 2) — > I as n — > oo is possible not only on aver- 
age but for any given primary structure. Indeed, con- 
sider a random heterpolymer RNA constituted of A and 
B monomers, forming saturating bonds of type A-A and 



B-B and construct the optimal structure as follows. Take 
the left end of the chain as a starting point, and move 
along a sequence until meeting the first pair of two se- 
quential letters AA or BB. Connect these two letters with 
a bond and erase them from the sequence. Iterating this 
procedure, one arrives finally to an alternating sequence 
of the type ABAB... (we have assumed that the starting 
letter is A). Connect now the first letter A from this se- 
quence to the last one, the next B to the B before the last 
A, etc. It is clear that this algorithm results in a nested 
secondary structure which leaves unmatched at most two 
letters (one - in the middle of the ABAB... -sequence and, 
possibly, another one in the very end). The fraction of 
mismatched letters decreases as n^ 1 with n, proving the 
conjecture. A similar algorithm for alternating (A-B) 
bonding can be easily constructed (though the fraction 
of mismatches decreases as n -1 / 2 in this case). Note, that 
this lower bound is already nontrivial: in the celebrated 
"longest common subsequence" used for the comparison 
of two linear DNA sequences, the fraction of matches 
equals /n„(c = 2) = 2(\/2 — 1) < I , and the "critical" 
alphabet size, at which fy in = 1, is cj. 1 " = 1. 

To construct the upper bound for c cr , recall the one-to- 
one mapping between cactus-like RNA secondary struc- 
tures and discretized Brownian excursions, known as 
Motzkin paths [l4|. Under this mapping, shown in FigJ^l 
the gapless ("perfect") secondary structures correspond 
to excursions with no horizontal steps, the Dyck paths. 
The total number D{n) of Dyck paths of even length n 
is given by a Catalan number C n /2- 
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where T[n] is the T- function, and the asymptotic expres- 
sion is valid for n 3> 1. 




Figure 2: The secondary structure of RNAs: (a) with gaps 
represented by Motzkin path; (b) gapless represented by Dyck 
path. 



Consider a set of random sequences of length n. Each 
of these sequences (there are c™ of them) must corre- 
spond to a certain perfect match, i.e. a Dyck path. 
Meanwhile, if one particular Dyck path corresponds to 
a perfect match of some particular sequence, it simulta- 
neously corresponds to perfect matches of many others. 
Indeed, each "up-down" pair of steps in a Dyck path can 
be realized in c different ways (A-A, B-B, etc..) inde- 
pendently of all others, leading to a degeneracy of order 
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Thus, the number of different primary sequences where f n 



which can have perfect secondary structures is at most 



W(c, n) = c n ' 2 D(n) 
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One primary sequence can be represented by several 
Dyck paths, thus this is an estimate from above. Com- 
paring the value W(c, n) to the total number of primary 
sequences, Wo(c, n) = c", we have for n^> 1: 

\nW(c,n) lnWp(n) max 
hm > hm — for c < c" = 4 



lim 
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One can follow this reasoning to develop the upper 
bound also for /(c) at c > c™ ax = 4. In this case the 
fraction of random primary sequences admitting a per- 
fect match among all Wo (c, n) of them is exponentially 
small. Therefore, the ground states of almost all of se- 
quences should correspond to matchings with gaps, i.e. 
to Motzkin paths. The Motzkin paths with finite fraction 
of gaps (horizontal steps) produce much more possibili- 
ties for the RNA ground states than Dyck paths of the 
same length. The number of n-step Motzkin paths with 
to gaps is M(m, n) = m , { "L m y D(n - to) and 
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and ((5J) works for n 3> 1 and even m — n. 

How many different primary structures can have a 
given Motzkin path as a ground state? Each pair of "up- 
down" steps is bound to belong to the same species, as 
for the Dyck paths, while each horizontal step can be 
chosen independently. The total degeneracy Z is thus 



Z(c,n,f) 
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As / decreases, the total number of structures which 
can have ground states with the fraction of matches more 
than / increases and is given by 



W(c,n,f) 
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At some / it becomes equal to the total number of 
possible primary structures Wo(c, n) = c™, giving the 
estimate for the typical value of /(c). For n 3> 1 
the sum in ((8]) can be evaluated up to the leading or- 
der using the saddle-point approximation. One has for 
Aw(f,c) = lim ri 



1 , W{c,nJ) . 
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For / < / m the sum in (JS]) is domi- 



nated by contribution from the upper boundary, while for 
/ < /m it is given by the maximum at / m and is, there- 
fore, independent of the upper summation limit. The 
desired value of /(c) is defined by the solution of the 
equation Aw(f, c) = and is plotted in Figj3] with a 
dotted line I24J1 . 
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Figure 3: The average fraction of connected nucleotides in 
the ground state secondary structure, /, as a function of the 
number of nucleotide types, c. A - the numeric dependence 
obtained as an average value for 500 randomly generated se- 
quences of length n — 200 each, • - the approximation to 
chains of infinite length, ■ - upper analytic estimation @ 
(the line connecting the points are a guide for the eyes), inset 
shows the finite-size scaling used to obtain the data at infinite 
lengths. 

We have analyzed numerically the statistical proper- 
ties of the ground state free energy E n (c), applying ([2]) 
to the random sequences with different numbers of let- 
ters (nucleotide types) c. In FigJ3] we show the numeric 
results for the average value of /(c) for c = 2,..., 8 for 
sequences of finite length n = 200, as well as the limiting 
values extrapolated to n — > oo. The experimental val- 
ues of /(c) lay lower then the upper bound given by the 
theory. To go beyond the integer values of c one can use 
the analogue with the linear Bernoulli matching prob- 
lem (j~7j and replace the correlated matrix £jj with the 
uncorrelated random matrix, whose entries are indepen- 
dent randomly distributed variables taking the value u 
with the probability c _1 and otherwise. The numerical 
results seem to show that this simplified model belongs to 
the same universality class, and the change in /(c) due to 
the removal of correlations in adjacency matrix is lower 
than 1%. Moreover, in the Bernoulli case, the generaliza- 
tion to the non-integer values of c is straightforward and 
the numerical simulations show that the transition from 
perfect to non-perfect match occurs at c w 2.7. More 
details on the Bernoulli RNA-like matching will be pro- 
vided elsewhere [l8l |. 

Summing up, we demonstrate here that alphabets with 
different number of letters, c, are nonequivalent if one 
considers the matching problem of long random RNA. 
This nonequivalcncc is tightly coupled to the restric- 
tions on the morphology of allowed secondary structures. 
Indeed, the existence of two regimes (for c < c cr and 
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c > c cr ) is a peculiarity of RNAs and is due to the addi- 
tional freedom in the formation of the complex cactus- 
like secondary structures typical for messenger RNAs. 
For linear matching problem used in DNA comparison, 
the fraction of nucleotides in the optimal alignment is less 
than 1 for any alphabet with c > 1. In our model the 
transition between two regimes occurs at 2 < c cr < 4. 
The exact value of the critical alphabet size should be 
sensitive to the microscopic details of the model, and one 
can enumerate factors which are neglected in our model 
and which could shift the transition point to the right 
or to the left from the observed critical value. On the 
one hand, the presence of stacking energies and minimal 
loop sizes in real RNA leads to the bonds being effec- 
tively formed not by single nucleotides, but by blocks of 
them, increasing the effective alphabet size for given c, 
thus, decreasing c cr in terms of the size of a "bare" alpha- 
bet. On the other hand, one would not expect any real- 
life random RNA to have a completely random struc- 
ture with exactly equal concentrations of letters and no 
short-range correlations between them. Any such corre- 
lations reduce the information entropy of the sequence, 
and, therefore, lead to the decrease of the effective al- 
phabet size, and thus, push c cr to higher values. The 
exact value of c cr is non-universal. However our analysis 
shows: (i) the existence of two different morphological 
regimes, depending on the number of nucleotide types in 
the alphabet, and (ii) the fact that this transition point 
can plausibly be rather close to 4. 



This particular number, obviously, sounds suggestive 
since it is exactly the number of nucleotide types in the 
alphabet used in real- world RNAs. The criticality on al- 
phabet size, observed only for RNAs thus nicely rhymes 
with the modern opinion that the life originates from the 
template-directed replication of random RNA molecules 
(the so-called "RNA world" hypothesis) [H Hjj. Can 
it be indeed advantageous to have the alphabet of criti- 
cal or closc-to-critical size? For RNA to have a biologi- 
cal function it should: i) fold predictably, and ii) form 
a robust structure not too sensitive to thermal noise. 
Short nucleotide alphabets with c < c cr tend to pro- 
duce structures which have many different ground states 
(see (151), also compare with similar reasoning for proteins 
|2l|,[22l). On the other hand, long alphabets correspond 
to loosely bound ground states with many unpaired nu- 
cleotides, which is disadvantageous in terms of stability 
of the structure. The critical alphabets, thus, seem to be 
optimal for biological purpose. 
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